feat: [DSM-142] Use CanisterStates in ReplicatedState#10287
feat: [DSM-142] Use CanisterStates in ReplicatedState#10287alin-at-dfinity wants to merge 23 commits into
CanisterStates in ReplicatedState#10287Conversation
Lays the foundation for splitting `ReplicatedState::canister_states` into
"hot" (potentially active) and "cold" (definitely idle) pools, so that
per-round operations can skip the long tail of idle canisters.
This PR is intentionally a no-op for the running replica: it only adds
the new types and predicates. The integration into `ReplicatedState` and
the migration of all consumers follow in subsequent PRs.
Specifically:
* `CanisterState::is_cold()` — pure predicate that classifies a canister
as "definitely idle": no input/output, no task queue entries, no
heartbeat method, inactive global timer, not `Stopping`, no
unexpired best-effort callbacks, and no scheduler debits.
* `CallContextManager::has_unexpired_callbacks()` and the matching
`SystemState::has_unexpired_callbacks()` accessor, used by `is_cold`.
* `CanisterStates`, a hot/cold-partitioned container with eager
promotion (mutations land in `hot`) and lazy demotion (via
`try_cool`/`try_cool_all`), plus the common map operations
(`get`/`get_mut`/`insert`/`remove`/`contains_key`/`len`/`is_empty`/
`retain`), per-pool iterators (`hot_iter`/`hot_values`/
`hot_values_mut`), merged iterators in `CanisterId` order
(`all_iter`/`all_keys`/`all_values`), and bulk mutation
(`for_each_mut`/`try_for_each_mut`).
* `CanisterStates::validate_strict_split()` for the canonical-partition
invariant used in checkpoint validation.
* `debug_assert_invariants()` runs on every mutating operation in
debug builds.
`ColdStats` and the aggregate accessors (`total_compute_allocation`,
`total_canister_memory_usage`, `memory_taken`, `callback_count`, ...)
are intentionally **not** part of this PR — they will be added once the
struct is in place.
Co-authored-by: Cursor <cursoragent@cursor.com>
Maintains a small `ColdStats` aggregate over the canisters in the `cold` pool, updated incrementally on every transition into / out of `cold`. This lets the "touch every canister" aggregate queries — `total_compute_allocation`, `total_canister_memory_usage`, `memory_taken`, `callback_count`, `guaranteed_response_message_memory_taken`, `best_effort_message_memory_taken` — run in `O(|hot|)` instead of `O(|all canisters|)`, which is the primary motivation for the hot/cold split on subnets with a long tail of idle canisters. The aggregates are derived (not persisted) and are reconstructed by `CanisterStates::new` on checkpoint load. `debug_assert_invariants` (now also runs an `O(|cold|)` recompute and compares against the live aggregate) ensures every mutating method keeps them in sync, and the `ColdStats` struct stays module-private — callers always reach the totals through the public aggregator methods on `CanisterStates`. `MemoryTaken`'s fields are bumped from private to `pub(crate)` so that `CanisterStates::memory_taken` can construct the struct directly, keeping `MemoryTaken` in its current home in `replicated_state.rs`. `CanisterStates::memory_taken` itself is `pub(crate)` and will be wired up to `ReplicatedState::memory_taken` in the next PR; an `#[allow(dead_code)]` keeps the build warning-free until then. Aggregator behaviour is exercised by two new tests (`memory_aggregators_combine_hot_and_cold`, `callback_count_combines_hot_and_cold`) and the bookkeeping discipline is exercised by an extended set of `*_updates_cold_stats*` tests covering `insert`, `remove`, `try_cool*`, `for_each_mut`, `try_for_each_mut`, and `retain`. Co-authored-by: Cursor <cursoragent@cursor.com>
…ry. Rename raw_memory to execution_memory, so it better matches the equivalent MemoryTaken field. Update documentation and tests.
A canister can satisfy `CanisterState::is_cold()` while still holding a guaranteed-response slot reservation: `is_cold()` only requires empty input/output *messages* (the pool count) and no unexpired best-effort callback, both of which are independent of whether the canister has in-flight guaranteed-response requests. A canister that has pushed a guaranteed-response request that's already been moved to an outgoing stream still keeps the input-slot reservation for the eventual response, which contributes `MAX_RESPONSE_COUNT_BYTES` to its `guaranteed_response_message_memory_usage()`. The previous commit dropped this field from `ColdStats` on the assumption it was always zero. It isn't, and the consequence is that `guaranteed_response_message_memory_taken()` quietly under-reports subnet-wide memory: promoting a cold canister with a reservation to `hot` (e.g. on the next `get_mut`) makes the subnet total jump up out of nowhere, breaking conservation invariants in downstream code (stream handler `debug_assert!`s, in particular). Restore the field and the corresponding `add`/`sub` bookkeeping, fold it into `guaranteed_response_message_memory_taken`, `total_canister_memory_usage`, and `memory_taken`, and add a focused test (`cold_canister_with_guaranteed_response_reservation_is_aggregated`) exercising the case via `push_output_request` followed by draining the output queue. Best-effort message memory remains hot-only: an unexpired best-effort callback forces the canister into `hot`, and any expired best-effort callback shows up as a pending input which also forces `hot`. Co-authored-by: Cursor <cursoragent@cursor.com>
Switches `ReplicatedState::canister_states` from a flat
`BTreeMap<CanisterId, Arc<CanisterState>>` to `CanisterStates`,
exposing the hot/cold partition to the rest of the system and
migrating every caller.
`ReplicatedState` changes:
* `canister_states` field is now `CanisterStates`.
* Drop `canisters_iter_mut()`. Round-level callers move to
`hot_canisters_iter_mut()` (skips the long tail of cold
canisters); bulk callers move to `canisters_for_each_mut` /
`canisters_try_for_each_mut`, which iterate every canister and
re-establish the partition afterwards.
* Add `hot_canisters_iter()` for read-only hot-only iteration.
* Add `repartition_canister_states()`, called from
`StateManager::commit_and_certify` after
`flush_checkpoint_ops_and_page_maps` to drive canisters that went
quiet during the round back into `cold` before checkpointing, so
that replicas continuing through a checkpoint and replicas
(re)starting from it agree on the partition.
* `take_canister_states` / `put_canister_states` now exchange the
`CanisterStates` directly instead of going through a flat
`BTreeMap` round-trip.
* Aggregator delegations: `total_compute_allocation`, `memory_taken`,
`total_canister_memory_usage`,
`guaranteed_response_message_memory_taken`,
`best_effort_message_memory_taken`, `callback_count` now delegate
to `CanisterStates` and run in `O(|hot|)`.
`state_manager`:
* `commit_and_certify` calls `state.repartition_canister_states()`
after `flush_checkpoint_ops_and_page_maps` and before tip
handover.
* `validate_eq_canister_states` calls
`CanisterStates::validate_strict_split` on the reference state to
verify that the persisted partition matches what
`CanisterStates::new` would produce on a fresh load.
* `flush_checkpoint_ops_and_page_maps` and
`switch_to_checkpoint` switch from `canisters_iter_mut` to
`canisters_for_each_mut` / `canisters_try_for_each_mut`.
* Bench: `bench_traversal` likewise.
`execution_environment`:
* `scheduler.rs`: scheduler hot-only iteration where appropriate
(`add_heartbeat_and_global_timer_tasks`,
`purge_expired_ingress_messages`, the
`ongoing_long_install_code` check); migrate
`charge_canisters_for_resource_allocation_and_usage` and the
log-memory-store migration loop to `canisters_for_each_mut`.
* `round_schedule.rs`: `partition_canisters_to_cores` now takes /
returns a `CanisterStates`; idle canisters are dropped before the
main hot-canister iteration.
* `query_handler.rs`, `execution_environment.rs`: callers updated.
* `canister_manager/tests.rs`, scheduler tests
(`scheduling.rs`, `metrics.rs`, `dts.rs`, `ecdsa.rs`,
`round_schedule/tests.rs`, `test_utilities.rs`, `tests.rs`)
updated.
* `benches/scheduler.rs`: updated.
`canonical_state`:
* `lazy_tree_conversion.rs`: new `CanisterStatesFork<'_>` that
presents a `CanisterStates` as a `LazyFork` over the merged
hot+cold pools in `CanisterId` order.
`canister_sandbox`:
* `sandboxed_execution_controller.rs`: switch
`evict_sandbox_processes` to per-id `state.canister_state(id)` /
`state.canister_priority(id)` lookups (also enables removing the
bulk `canister_accumulated_priorities` method). This duplicates
the standalone "perf: Look up sandbox scheduler priorities per
canister" PR; whichever lands first, the other becomes a no-op.
`messaging`:
* `stream_handler/tests.rs`: pre-heat `LOCAL_CANISTER` in the
`out_of_memory` reject-signal test so that the expected and
inducted states share the same hot/cold partition.
* `stream_builder/tests.rs`, `state_machine/tests.rs`,
`tests/common/mod.rs`: caller updates.
`replicated_state` queues and system_state:
* `CanisterQueues` / `SystemState` `local_canisters` parameter type
flips from `&BTreeMap<CanisterId, Arc<CanisterState>>` to
`&CanisterStates` (no behavioural change; queues only need
`contains_key`).
* `replicated_state.rs` deletes the now-unused
`canister_accumulated_priorities` method.
`metrics.rs`:
* `check_dts` walks `hot_canisters_iter()` (only hot canisters can
have non-empty task queues).
* `check_subnet_memory_usage` switches to
`CanisterStates::memory_taken()` for `O(|hot|)` aggregation.
`test_utilities` and `state_tool`:
* `test_utilities/execution_environment`, `test_utilities/state`,
and `state_tool/src/commands/canister_metrics.rs` updated to use
the new iteration APIs.
Co-authored-by: Cursor <cursoragent@cursor.com>
… canisters fron one pool to the other; add more tests for is_cold(); misc test additions.
dfinity#10288) Move the "drop idle canisters with 0-100 AP from the subnet schedule" logic out of the `NextExecution::None` branch of the main per-canister loop and into a dedicated pre-loop at the top of `start_iteration`. Behavior is unchanged: the same set of idle canisters with priorities in the 0-100 AP range get dropped. Also clarify the doc comment for `IterationSchedule::partition_canisters_to_cores`. This is a small standalone refactor extracted from dfinity#10287, where the main per-canister loop will switch from iterating all canisters to iterating only hot canisters (at which point hoisting becomes a correctness requirement: cold canisters would otherwise no longer be visited by the main loop and their idle entries would not be dropped). --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: IDX GitHub Automation <infra+github-automation@dfinity.org>
…c comments; simplify memory_taken().
…lementation; also apply pub(crate) to best_effort_message_memory_taken() and guaranteed_response_message_memory_taken(), as they are also potentially dangerous to use directly.
…d() test, so that all stats are covered; and for both hot and cold canisters.
| /// Garbage collects empty canister and subnet queues. | ||
| pub fn garbage_collect_canister_queues(&mut self) { | ||
| for (_canister_id, canister) in self.canister_states.iter_mut() { | ||
| for canister in self.canister_states.hot_values_mut() { |
There was a problem hiding this comment.
Claude complains that cold canisters may still have empty queues which will now no longer be collected
There was a problem hiding this comment.
Great point, thank you.
This is merely an optimization (we don't write an empty queues.pbuf file for a canister with no queues), but it's a really nice to have optimization. And only GC-ing active canisters' queues would have broken it completely.
| let (_height, mut replicated_state) = self.state_manager.take_tip(); | ||
| let mut synthetic_responses = vec![]; | ||
| for canister_state in replicated_state.canisters_iter_mut() { | ||
| for canister_state in replicated_state.hot_canisters_iter_mut() { |
There was a problem hiding this comment.
Claude complains that a canister blocked on an outstanding remote guaranteed-response call is cold (it holds only an input-slot reservation, with no enqueued messages, tasks or unexpired best-effort callbacks), yet its remote callback still needs rejecting.
There was a problem hiding this comment.
Good point, fixed.
… ReplicatedState::garbage_collect_canister_queues(); same in StateMachine::reject_remote_callbacks().
[Context: a3ba27a and 7813995.]
Switches
ReplicatedState::canister_statesfrom a flatBTreeMap<CanisterId, Arc<CanisterState>>toCanisterStates, exposing the hot/cold partition to the rest of the system and migrating every caller.ReplicatedStatechanges:canister_statesfield is nowCanisterStates.canisters_iter_mut(). Round-level callers move tohot_canisters_iter_mut()(skips the long tail of cold canisters); bulk callers move tocanisters_for_each_mut/canisters_try_for_each_mut, which iterate every canister and re-establish the partition afterwards.hot_canisters_iter()for read-only hot-only iteration.repartition_canister_states(), called fromStateManager::commit_and_certifyafterflush_checkpoint_ops_and_page_mapsto drop canisters that went idle during the round back intocoldbefore checkpointing, so that replicas continuing through a checkpoint and replicas (re)starting from it agree on the partition.take_canister_states/put_canister_statesnow return / acceptCanisterStatesinstead ofBTreeMap.total_compute_allocation,memory_taken,total_canister_memory_usage,guaranteed_response_message_memory_taken,best_effort_message_memory_taken,callback_countnow delegate toCanisterStatesand run inO(|hot|).state_manager:commit_and_certifycallsstate.repartition_canister_states()afterflush_checkpoint_ops_and_page_mapsand before tip handover.validate_eq_canister_statescallsCanisterStates::validate_strict_spliton the reference state to verify that the persisted partition matches whatCanisterStates::newwould produce on a fresh load.flush_checkpoint_ops_and_page_mapsandswitch_to_checkpointswitch fromcanisters_iter_muttocanisters_for_each_mut/canisters_try_for_each_mut.bench_traversallikewise.execution_environment:scheduler.rs: scheduler hot-only iteration where appropriate (add_heartbeat_and_global_timer_tasks,purge_expired_ingress_messages, theongoing_long_install_codecheck); migratecharge_canisters_for_resource_allocation_and_usageand the log-memory-store migration loop tocanisters_for_each_mut.round_schedule.rs:partition_canisters_to_coresnow takes / returns aCanisterStates; idle canisters are dropped before the main hot-canister iteration.query_handler.rs,execution_environment.rs: callers updated.canister_manager/tests.rs, scheduler tests (scheduling.rs,metrics.rs,dts.rs,ecdsa.rs,round_schedule/tests.rs,test_utilities.rs,tests.rs) updated.benches/scheduler.rs: updated.canonical_state:lazy_tree_conversion.rs: newCanisterStatesFork<'_>that presents aCanisterStatesas aLazyForkover the merged hot+cold pools inCanisterIdorder.messaging:stream_builder/tests.rs,state_machine/tests.rs,tests/common/mod.rs: caller updates.replicated_statequeues and system_state:CanisterQueues/SystemStatelocal_canistersparameter type flips from&BTreeMap<CanisterId, Arc<CanisterState>>to&CanisterStates(no behavioral change; queues only needcontains_key).metrics.rs:check_dtswalkshot_canisters_iter()(only hot canisters can have non-empty task queues).check_subnet_memory_usageswitches toCanisterStates::memory_taken()forO(|hot|)aggregation.test_utilitiesandstate_tool:test_utilities/execution_environment,test_utilities/state, andstate_tool/src/commands/canister_metrics.rsupdated to use the new iteration APIs.